class: center, middle, inverse, title-slide # Introduction to R ## A crash course ### i42 Testing Team ### 2021-10-21 --- # What we're covering today - What is **R** and what can you do with it? - The **basics**: data structures! - Importing and **wrangling** your data - A tiny bit of **plotting** if we have time --- name: colors # Disclaimer Note that **all of this is new**. If you have any feedback or have thoughts on other specific trainings we could cover, you know where I live (virtually). --- # What is R? - R was a language for statistical computing and graphics, but it's grown to be much more than that. - In that sense it's like Stata. Who's been to the Stata training? -- .panelset[ .panel[.panel-name[What's similar?] ## Things that are similar - You open datasets - You wrangle data - You analyze it - You report results ] .panel[.panel-name[What's different?] ### Things that are different - You can have multiple datasets - You have an actual programming language - You have incredible flexibility - The world is your oyster (see this prez) ] ] --- ## E.g. Bill depth and bill length are correlated in penguins
--- class: inverse center middle ## What about RStudio? Rstudio is the most commonly used IDE (integrated development environment) for programming with R Link to download: https://www.rstudio.com/products/rstudio/download/ --- ## Before we start `R` works with packages. Packages **expand the universe of things** you can do with `R`, often in spectacular fashion. - To keep things simple, we'll just be installing the `tidyverse` family of packages, and the `gapminder` package, which we'll be using to explore the gapminder dataset. - To install these, just run these two commands in the console: ```r install.packages("tidyverse") install.packages("gapminder") ``` - You only need to install packages once in your computer. Once you do that, you can just load them when you need it (typically at the beginning. ```r library(tidyverse) library(gapminder) ``` --- # Data types The most basic data structure in R is a vector. - Vectors have one dimension (think of it as a line) - Vectors can be **numeric**, **character**, **logical**... - But they must all share the same type! Here are some examples: ```r #You can create a vector by using the assignment operator my_letters <- c("a", "b", "c") my_numbers <- c(1,2,3) # Now let's see what happens if we mix all_mixed <-c("a", 2, "b", 3) ``` --- ## A vector is a vector is a data frame And what's a dataset? It's just a bunch of **vectors** stuck together, vertically. ```r sheep_name <- c("Molly", "Polly", "Dolly") sheep_weight <- c(120, 90, 85) sheep_age <- c(3,4,2) my_sheep <- data.frame(sheep_name, sheep_weight, sheep_age) ``` Let's print it out: ```r my_sheep ``` ``` sheep_name sheep_weight sheep_age 1 Molly 120 3 2 Polly 90 4 3 Dolly 85 2 ``` --- ## Data importing and wrangling - We don't always build datasets this way, but it's useful as a **mental model**. - Typically we do just like in Stata, and we **read in** some data. - To do that, we will use the family of `read_*` functions available as part of the `tidyverse`. You'll find an example at the end of your script, to look at post-training. - For our training, we'll just use the `gapminder` dataset, which we can call using `data(gapminder)`, since it's already installed with the package. ```r data(gapminder) ``` Can you all see it? --- ## Let's talk grammar There are several verbs available for data wrangling. We'll take a look at these: - `select()` - `filter()` - `mutate()` - `arrange()` - `summarize()` - `group_by()` With these 6 tools, you should be able to do about 90% of the things you want to do when wrangling data. --- class: inverse center middle ### For the other 10%: Google + Stackoverflow are your friends.  --- ## Taking a look Let's take a look at our dataset. You can look at the first 5 observations by using `head()`: ```r head(gapminder) ``` ``` # A tibble: 6 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. 5 Afghanistan Asia 1972 36.1 13079460 740. 6 Afghanistan Asia 1977 38.4 14880372 786. ``` --- ## Taking a look You can also use `summary()` to get an overview of the dataset: ```r summary(gapminder) ``` ``` country continent year lifeExp Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60 Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20 Algeria : 12 Asia :396 Median :1980 Median :60.71 Angola : 12 Europe :360 Mean :1980 Mean :59.47 Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85 Australia : 12 Max. :2007 Max. :82.60 (Other) :1632 pop gdpPercap Min. :6.001e+04 Min. : 241.2 1st Qu.:2.794e+06 1st Qu.: 1202.1 Median :7.024e+06 Median : 3531.8 Mean :2.960e+07 Mean : 7215.3 3rd Qu.:1.959e+07 3rd Qu.: 9325.5 Max. :1.319e+09 Max. :113523.1 ``` --- ## Select variables and filter observations Use `select` to pick and choose **variables**, and `filter` to pick and choose **observations**. A great thing in `R` is that you don't lose your data. You can just create a **new, updated object.** Let's keep only the `country`, `year`, and `lifeExp` variables. ```r df_redux <- select(gapminder, country, year, lifeExp) head(df_redux) ``` ``` # A tibble: 6 x 3 country year lifeExp <fct> <int> <dbl> 1 Afghanistan 1952 28.8 2 Afghanistan 1957 30.3 3 Afghanistan 1962 32.0 4 Afghanistan 1967 34.0 5 Afghanistan 1972 36.1 6 Afghanistan 1977 38.4 ``` --- ## Select variables and filter observations Now let's try using `filter()`. You can filter by any condition! Here we'll filter by year and country. ```r df_filtered <- filter(gapminder, year == 1957, country == "Belgium") df_filtered ``` ``` # A tibble: 1 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Belgium Europe 1957 69.2 8989111 9715. ``` --- ## You can also combine these easily .pull-left[ - Enter the pipe: `%>%` - Think of the pipe as "...and then..." - The pipe allows you to chain operations, and will make your life easier - Let's try filtering by continent and year, and keeping three variables: ```r gapminder %>% filter(continent == "Oceania", year == 2007) %>% select(continent, country, lifeExp) ``` ``` # A tibble: 2 x 3 continent country lifeExp <fct> <fct> <dbl> 1 Oceania Australia 81.2 2 Oceania New Zealand 80.2 ``` ] .pull-right[ <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#pipe.jpeg" alt="This is not a pipe." width="70%" /> <p class="caption">This is not a pipe.</p> </div> ] --- ## Reference: `select()` Select by column name: ```r gapminder %>% select(country) ``` Select by column position: ```r gapminder %>% select(1:2) ``` Select all columns except the first: ```r gapminder %>% select(-1) ``` --- ## Mutate to create (or change) variables - We don't have a GDP variable! Let's create it by multiplying GDP per capita times the population, and save it as a new dataframe. ```r df_gdp <- gapminder %>% mutate(gdp = pop * gdpPercap) ``` Ok, that's quite big. We could also divide it by 1,000,000 to have it in US$ million, or even take the log! ```r df_gdp <- df_gdp %>% mutate(gdp_million = gdp / 1000000, gdp_log = log(gdp)) ``` --- ## Arrange to sort your dataframes - Let's see what was the country with the lowest life expectancy in 1957: ```r gapminder %>% filter(year == 1957) %>% arrange(lifeExp) %>% head(3) # You can use head to restrict the output to any number of observations ``` ``` # A tibble: 3 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1957 30.3 9240934 821. 2 Sierra Leone Africa 1957 31.6 2295678 1004. 3 Angola Africa 1957 32.0 4561361 3828. ``` What about the highest? ```r gapminder %>% filter(year == 1957) %>% arrange(desc(lifeExp)) %>% head(3) ``` ``` # A tibble: 3 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Iceland Europe 1957 73.5 165110 9244. 2 Norway Europe 1957 73.4 3491938 11654. 3 Netherlands Europe 1957 73.0 11026383 11276. ``` ```r df_2007 <- gapminder %>% filter(year == 2007) ``` --- ## Summarize your data - Very often, we're interested in calculating means, counts, medians, standard deviations. - These are all **summary stats** and we can calculate them with `summarize()`. - For example, imagine we're interested in the mean, median, and sd of life expectancy in 2007: ```r gapminder %>% filter(year == 2007) %>% summarize(exp_mean = mean(lifeExp), exp_median = median(lifeExp), exp_sd = sd(lifeExp)) ``` ``` # A tibble: 1 x 3 exp_mean exp_median exp_sd <dbl> <dbl> <dbl> 1 67.0 71.9 12.1 ``` --- ## Groups of things - But more often, we care about summarizing across groups! This comes up all the time in our work. - For example, ```r df_2007 %>% group_by(continent) %>% summarize(exp_mean = mean(lifeExp), gdp_mean = mean(gdpPercap)) ``` ``` # A tibble: 5 x 3 continent exp_mean gdp_mean <fct> <dbl> <dbl> 1 Africa 54.8 3089. 2 Americas 73.6 11003. 3 Asia 70.7 12473. 4 Europe 77.6 25054. 5 Oceania 80.7 29810. ``` - Note that `group_by()` doesn't produce any output, it just groups things in advance of whatever operation comes afterwards (typically, `summarize()` --- ## Reference: Other useful wrangling functions - `distinct()` to select unique rows from a data frame - `top_n()` to filter the top n observations (you can use it in combination with `arrange()`) - `ntile()` ranks values into the number of groups you provide - `rename()` to change the name of the variables in your data frame - All functions: https://dplyr.tidyverse.org/reference/index.html --- class: inverse middle center # Plotting --- ## The grammar of graphics .pull-left[ - **Reminder**: Always, always, ALWAYS plot your data. Please. - The most popular (and coolest) plotting library in R is `ggplot2`. It follows principles based on the **grammar of graphics** - What you should remember about this is that a plot is based on different layers that can be combined to create pretty much everything. - Let's look at a basic example. ] .pull-right[ <div class="figure"> <img src="data:image/png;base64,#gog.png" alt="A plot is like an onion." width="2561" /> <p class="caption">A plot is like an onion.</p> </div> ] --- ## Plotting with `ggplot2` .pull-left[ ### Mappings and geoms - The two crucial starting points are the `aesthetic mappings` and the `geoms`. These tell your graph what variables are mapped to what, and what geometry we're going to use to represent them. - For two variables, a scatterplot is common: ```r gapminder %>% filter(year == 1982) %>% * ggplot(aes(x = gdpPercap, * y = lifeExp)) + * geom_point() ``` ] .pull-right[ <!-- --> ] --- ## Plotting with `ggplot2` .pull-left[ ### Adding a geom - But just playing around with things, we can get very different results. - We can add a geometry that shows a smooth line (or a linear model). ```r gapminder %>% filter(year == 1982) %>% ggplot(aes(x = gdpPercap, y = lifeExp)) + geom_point() + * geom_smooth() ``` ] .pull-right[ <!-- --> ] --- ## Plotting with `ggplot2` .pull-left[ ### Adding a geom - We can also add an aesthetic to encode continent as color. ```r gapminder %>% filter(year == 1982) %>% ggplot(aes(x = gdpPercap, y = lifeExp, * color = continent)) + geom_point() ``` ] .pull-right[ <!-- --> ] --- class: inverse center middle ## Anatomy of a ggplot --- count: false .panel1-my_gap-auto[ ```r *gapminder ``` ] .panel2-my_gap-auto[ ``` # A tibble: 1,704 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. 5 Afghanistan Asia 1972 36.1 13079460 740. 6 Afghanistan Asia 1977 38.4 14880372 786. 7 Afghanistan Asia 1982 39.9 12881816 978. 8 Afghanistan Asia 1987 40.8 13867957 852. 9 Afghanistan Asia 1992 41.7 16317921 649. 10 Afghanistan Asia 1997 41.8 22227415 635. # … with 1,694 more rows ``` ] --- count: false .panel1-my_gap-auto[ ```r gapminder %>% * filter(year == 1982) ``` ] .panel2-my_gap-auto[ ``` # A tibble: 142 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1982 39.9 12881816 978. 2 Albania Europe 1982 70.4 2780097 3631. 3 Algeria Africa 1982 61.4 20033753 5745. 4 Angola Africa 1982 39.9 7016384 2757. 5 Argentina Americas 1982 69.9 29341374 8998. 6 Australia Oceania 1982 74.7 15184200 19477. 7 Austria Europe 1982 73.2 7574613 21597. 8 Bahrain Asia 1982 69.1 377967 19211. 9 Bangladesh Asia 1982 50.0 93074406 677. 10 Belgium Europe 1982 73.9 9856303 20980. # … with 132 more rows ``` ] --- count: false .panel1-my_gap-auto[ ```r gapminder %>% filter(year == 1982) %>% * ggplot() ``` ] .panel2-my_gap-auto[ <!-- --> ] --- count: false .panel1-my_gap-auto[ ```r gapminder %>% filter(year == 1982) %>% ggplot() + * aes(x = gdpPercap) ``` ] .panel2-my_gap-auto[ <!-- --> ] --- count: false .panel1-my_gap-auto[ ```r gapminder %>% filter(year == 1982) %>% ggplot() + aes(x = gdpPercap) + * aes(y = lifeExp) ``` ] .panel2-my_gap-auto[ <!-- --> ] --- count: false .panel1-my_gap-auto[ ```r gapminder %>% filter(year == 1982) %>% ggplot() + aes(x = gdpPercap) + aes(y = lifeExp) + * geom_point() ``` ] .panel2-my_gap-auto[ <!-- --> ] --- count: false .panel1-my_gap-auto[ ```r gapminder %>% filter(year == 1982) %>% ggplot() + aes(x = gdpPercap) + aes(y = lifeExp) + geom_point() + * aes(color = continent) ``` ] .panel2-my_gap-auto[ <!-- --> ] --- count: false .panel1-my_gap-auto[ ```r gapminder %>% filter(year == 1982) %>% ggplot() + aes(x = gdpPercap) + aes(y = lifeExp) + geom_point() + aes(color = continent) + * aes(size = pop) ``` ] .panel2-my_gap-auto[ <!-- --> ] --- count: false .panel1-my_gap-auto[ ```r gapminder %>% filter(year == 1982) %>% ggplot() + aes(x = gdpPercap) + aes(y = lifeExp) + geom_point() + aes(color = continent) + aes(size = pop) + * theme_ipsum_tw() ``` ] .panel2-my_gap-auto[ <!-- --> ] --- count: false .panel1-my_gap-auto[ ```r gapminder %>% filter(year == 1982) %>% ggplot() + aes(x = gdpPercap) + aes(y = lifeExp) + geom_point() + aes(color = continent) + aes(size = pop) + theme_ipsum_tw() + * labs(title = "Income and life expectancy in 1982", * x = "GDP per capita", * y = "Life expectancy", * color = "Continent", * size = "Population") ``` ] .panel2-my_gap-auto[ <!-- --> ] --- count: false .panel1-my_gap-auto[ ```r gapminder %>% filter(year == 1982) %>% ggplot() + aes(x = gdpPercap) + aes(y = lifeExp) + geom_point() + aes(color = continent) + aes(size = pop) + theme_ipsum_tw() + labs(title = "Income and life expectancy in 1982", x = "GDP per capita", y = "Life expectancy", color = "Continent", size = "Population") + * scale_color_manual(values = palette_42("i42_bright")) ``` ] .panel2-my_gap-auto[ <!-- --> ] --- count: false .panel1-my_gap-auto[ ```r gapminder %>% filter(year == 1982) %>% ggplot() + aes(x = gdpPercap) + aes(y = lifeExp) + geom_point() + aes(color = continent) + aes(size = pop) + theme_ipsum_tw() + labs(title = "Income and life expectancy in 1982", x = "GDP per capita", y = "Life expectancy", color = "Continent", size = "Population") + scale_color_manual(values = palette_42("i42_bright")) ``` ] .panel2-my_gap-auto[ <!-- --> ] <style> .panel1-my_gap-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-my_gap-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-my_gap-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- class: inverse center middle # Thank you!